Previous Steps:
I found a website called “Mathematics Genealogy Project” from NDSU and it seemed like an interesting set of data to work with.
It is a database of people who have received as doctorate (or similar) in mathematics. This website includes these data points for a given individual:
Their complete name - Their degrees/dissertations For each university/dissertation: - The name of the university - The year(s) in which their degree was awarded - The complete title of the dissertation - The complete names of their advisors
They have collected this data from a variety of historical records and data submitted to them. It is presented on a fairly antiquated PHP website with an individual page for each mathematician that are (luckily) given incremental IDs.
From their FAQ:
Where do you get your data?We depend on information from our visitors for most of our data. In cases of partial information, we search Dissertation Abstracts International in an effort to find complete information. We have also entered a considerable amount of data found on lists of graduates maintained by individual departments.
I scrapped all the pages on the site (253772 of them) and then converted them into a dataframe.
The scrapping process involved using curl to download all of the pages and then going through them with rvest to get the important information into a dataframe. The most relevant code is below.
Downloading all the pages:
curl -OZ https://www.genealogy.math.ndsu.nodak.edu/id.php?id=[1-253772]
The function that builds the datatable:
parse.id = function(addr) {
return(sub(".*id.php\\?id=", '', addr) %>% strtoi)
}
process.indices <- function(ind) {
paste("site-dump-1/id.php_id=", ind, sep = "") %>% map(read_html) %>% map(html_node, "#paddingWrapper") %>% {
tibble(
id = ind,
name = map(., html_node, "h2") %>% map_chr(html_text, trim = TRUE),
schools = map(., ~ {
tibble (
thesis = html_nodes(.x, "#thesisTitle") %>% html_text(trim = TRUE),
university = html_nodes(.x, xpath = "//span[contains(@style,'#006633')]") %>% html_text(),
advisors = html_nodes(.x, xpath = '//p[contains(@style, "text-align: center; line-height: 2.75ex")]|//p[text()[contains(.,"Advisor")]]') %>% map(html_nodes, "a") %>% map(html_attr, "href") %>% map(map, parse.id),
countries = html_nodes(.x, xpath = "//div[contains(@style, 'line-height: 30px; text-align: center; margin-bottom: 1ex')]") %>% map(html_nodes, "img") %>% map(map, html_attr, "title")
)
}),
advisors = map(., html_nodes, xpath = '//p[contains(@style, "text-align: center; line-height: 2.75ex")]|//p[text()[contains(.,"Advisor")]]') %>% map(html_nodes, "a") %>% map(html_attr, "href") %>% map(map, parse.id),
students.raw = map(., html_node, "table") %>% map_if(~ class(.) != "xml_missing" ,html_table),
students.id = map(., html_node, "table") %>% map(html_nodes, "a") %>% map(html_attr, "href") %>% map(map, parse.id)
)
}
}
Processing the pages into the dataframe was very challenging. There were many variations in just a few of the pages that were hard to find and account for. The HTML of the pages themselves is not well structured and has very little hierarchy or tagging.
Some of the issues I ran into were:
Some of the data I would have liked to get was not practical to scrape into the dataframe. The year of each thesis is not well formatted and difficult to separate out of the page. There are sometimes multiple of them separated by commas or slashes.
After getting the data into R, I did some analysis on it as a dataframe and then converted it into a dataframe of nodes and one of edges so I could use the igraph library to look at the connections between people and visualize the structure of the graph.
I attempted to use the even more powerful Neo4j but it ended up with a lot of issues and I could not get the data into it.
Looking at some of the basic numerical information, most people had 1 advisor and it was very rare to have 3 or more. Additionally, the majority of people have no students. This makes sense as professors are a relatively small portion of the population. There is some spread in number of students advised, it goes to as many as 140, but very few professors have more than a few students they have advised.
We find that the majority of people in the database studied in the United States, with Germany and France having the next most.
I looked at some of the statistics that can be applied to graphs. Though many of the things in the library don’t work very well on such a large graph and would crash or hang forever.
One of those was looking at cliques, groups of nodes that are fully interconnected. I found that the largest cliques formed are of 4 nodes and there are 17 of them. In other words, there are 17 groups where every person has some connection to the other 3.
Here is a graph of 5 generations of “descendants” of Euler:
This is a very interesting data set but also large and difficult to work with. This project ended up being mostly wrangling the data into a useable form and I think I have accomplished that fairly well. The data has gone from 253772 individual pages on a website to an R dataframe and a working graph model. I wish I had been able to get it into Neo4j to be able to do more robust queries and analysis.